Skip to content

Conversation

@pmocz
Copy link
Member

@pmocz pmocz commented Mar 2, 2025

by Vincent Vanlaer

@warrickball
Copy link
Contributor

I decided that I'll fight with ifort after the build system is refreshed, so I configured our University cluster to run using the SDK. It submitted test results over the course of last night. There are 25 failures but I want to get some big picture input on that result.

tl;dr: We have two options:

  1. We could probably go back to whatever compiler options meant that stuff runs under the current build system.
  2. We can try to fix a lot of issues that have been raised first.

I'm strongly in favour of 2. I think these runs have flagged some genuine issues that are worth fixing now rather than sweeping back under a rug of compiler options. Or maybe some of these are fixed on main or just missing one or two reasonable compiler flags?

Regarding those issues, they actually fall into some large categories.

  • double_bh, 15M_dynamo, accreted_material_j, high_rot_darkening and ppisn fail because of a checksum difference in the final model when it's computed from a restart. This is presumably a symptom of something not being correctly passed into and out of the photo, though I can't tell at a glance if it's the same thing.
  • 1M_thermohaline, 20M_z2m2_high_rotation, conserve_angular_momentum and magnetic_braking hit a shared issue, where it looks like s% D_ST(2) is NaN on restart. I guess this is also something not being restored from a photo correctly.
  • The RSP tests all fail because of the same energy conservation test.
  • The astero failures are all caused by index out of bounds errors in ADIPLS. I'll make investigating these my next priority.

The remaining tests (carbon_kh, relax_composition_j_entropy and simplex_solar_calibration) have unique issues. simplex_solar_calibration has an unusual structure that includes a bit of astero that BillP isolated. I think the build system just needs updating here.

@warrickball
Copy link
Contributor

Also probably worth mentioning that this is a continuation of this original external PR (#724).

@VincentVanlaer VincentVanlaer force-pushed the VincentVanlaer/build-system branch from fecf573 to 971fa3c Compare April 16, 2025 18:42
@VincentVanlaer
Copy link
Collaborator

I have rebased on main cleaned up some things (e.g. extracted #808).

There are some test cases which run into issues. From what I could see running tests locally (i.e. without checksums), the ccsn, all rsp tests, and all adipls tests fail with FPEs. While these can easily be fixed by initializing certain variables to zero at somewhat arbitrary points, it doesn't feel like a clean solution. Moreover, adipls is not failing on some uninitialized variable, but on an actual divide by zero.

I actually prefer going for option one, I have added a commit that disables FPEs by default. Syncing this PR with main was rather painful, so I would rather see this merged sooner than later (as long as there are no outstanding issues of course). Since the option to enabled FPEs still exists, getting the tests/other parts of MESA to cooperate can still be done in the future.

I am running the test suite on the local cluster with the latest changes, and I'll see whether any other issues crop up. This is without the checksums, since they are not in the repository.

@VincentVanlaer
Copy link
Collaborator

Something I forgot to add is that the default insertion of NaNs could have also been a reason for the difference in the photos.

@VincentVanlaer VincentVanlaer force-pushed the VincentVanlaer/build-system branch from ccbe028 to d248833 Compare April 17, 2025 22:12
@VincentVanlaer VincentVanlaer force-pushed the VincentVanlaer/build-system branch 2 times, most recently from 7adde5f to ac4954d Compare May 16, 2025 14:00
@VincentVanlaer
Copy link
Collaborator

With the latest changes, this should be almost ready to merge.

All tests pass except for the OP mono opacity tests. It seems like they aren't properly tested in main (I presume due to the necessary data files not being present), but local runs on main show the same issue.

@pmocz I noticed that your test runner does not always succeed to compile, but no logs were available. Since I force pushed, it hasn't revisited this branch yet.

What remains to be done is a double check that everything still works (i.e. just do a install) on MacOS, older linux OSes, ... If all of that is good, then this can be merged.

@Debraheem
Copy link
Member

Awesome @VincentVanlaer, can you add a very clear changelog and perhaps update the installation and development documentation? I'm still not sure how the new build system affects my developement, nor how I should go about doing development with the new installation let alone how to install MESA or a build a module on its own and export it etc... I'm sure there are some further details we that need fleshed out before merging, but these are a couple of them off the top of my head.

@VincentVanlaer
Copy link
Collaborator

In terms of install, nothing has changed. The script itself has been changed to work with the new build system, but it should otherwise behave the same. Good point w.r.t. changelog, I did indeed forget that. In terms of documentation for developers, there is already a bunch of documentation in this PR on how the new build system works with also information on how to use a module in another project.

@warrickball
Copy link
Contributor

I've have a tiny bit of capacity for some MESA work and this PR is my priority. I've just built this branch on our cluster with the MESA SDK (on RHEL 8.10) and have the OP mono opacities (as I always do when testing on this cluster, bluebear_...). I'll run the two failing cases locally too and see if I can find a quick solution.

BTW, thanks for fixing the ADIPLS failures. They were my first goal based on when I last looked through the results but it seems you've fixed them in the meantime!

@VincentVanlaer
Copy link
Collaborator

That would be great :) If you fix those test cases, I think it would make most sense to submit them as a separate PR, since (or at least that was the case in my testing) they are also broken on main.

@warrickball
Copy link
Contributor

That would be great :) If you fix those test cases, I think it would make most sense to submit them as a separate PR, since (or at least that was the case in my testing) they are also broken on main.

Just to confirm that you're absolutely right: these fail on main. See e.g. these bluebear results.

So I think this looks ready to merge? It's a big change so we probably want an extra set of eyes, especially from the build side. @rhdtownsend?

@warrickball
Copy link
Contributor

Ah, I now realise I haven't reviewed the actual code fixes. I'll have a look as soon as I get a chance and put in a proper GitHub review.

@Debraheem
Copy link
Member

"there is already a bunch of documentation in this PR on how the new build system works with also information on how to use a module in another project."

Can this information make its way into some excerpt on the mesadocs website? Ultimately, anything explained inside this pr will be lost to history in due time. Perhaps under "modules", there can be a small section titled "build system" that rehashes this.

@Debraheem
Copy link
Member

Debraheem commented May 21, 2025

Super happy to see the new build system reach it's final form!

Given the MESA release candidate is coming very soon. At the moment, I am hesitant toward merging this in until after the release comes out, unless we delay the release. Imo, the new build system should live in main for some months before making into a new release. Are my concerns valid, and or @VincentVanlaer is your preferred timeline such that you'd prefer this to make it immediately into the next release? Am i just fearmongering? let me know.

Perhaps, this is something @VincentVanlaer can discuss with mesa-dev.

@pmocz
Copy link
Member Author

pmocz commented May 21, 2025

@VincentVanlaer have you been able to build on MacOS? I get a

make/defaults-module.mk:6: *** extracting `make` directory based on defaults-module.mk location failed.  Stop.

adipls has too much UB, plaster over it by disabling optimizations

use pointers instead of linker ordering for adipls callbacks
The preprocessor is not functional yet, but that is orthogonal to the
build system changes
The builder codes need some fixing before they can be used.
Plotter does not work with default inlist (and encounters NaNs in eos)
Work directories require a bit more massaging to work:

- The compilation of the work directory needs to be able to find the
  compiled artefacts in the main mesa build directory, but store its own
  output in the build directory in the work directory. This is
  accomplished by keeping the main build directory pointed to the MESA
  directory, and only set BUILD_DIRECTORY_MODULE to the work dir build
  directory, as only this one is used for the generated files.
- The binary module needs to refer to sources in the MESA directory.
  this is because these sources cannot be compiled in during the
  installation, as they depend on functions in the work directory
  sources.

From a user perspective, the main changes are:

- The removal of the `clean` and `mk` scripts. These are replaced by
  `make clean` and `make` respectively.
- The star executable will always be built before it is ran. This will
  eliminate issues where the one forgets to rebuild the executable.
How to actually use the preprocessor is left as an exercise to the
reader. This is to make the bit-rot not be worse than it already is.
There is more where this came from (e.g. RMIN). However, RMIN
initialized to garbage is load bearing for the rest of the code, so
dealing with this is postponed to another time.
Reduces the need for error checking later.
@VincentVanlaer VincentVanlaer force-pushed the VincentVanlaer/build-system branch from 93f91f8 to b2326f5 Compare October 28, 2025 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants